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Abstract 

In this paper I describe some results on the use of virtual processors 
technology for parallelize some SPMD computational programs. The 
tested technology is the INTEL Hyper Threading on real processors, 
and the programs are MATLAB scripts for floating points computa- 
tion. The conclusions of the work concern on the utility and limits 
of the used approach. The main result is that using virtual proces- 
sors is a good technique for improving parallel programs not only for 
memory-based computations, but in the case of massive disk-storage 
operations too. 

1 Introduction 



The processors virtualization technology permits to split a real physical pro- 
cessor into two virtual chips, so that the operating system, as MS Windows 
or Linux, of a computer can use the virtual processors as two real chips. Ex- 
ample of such technology is Intel's Hyper Threading [1]. The hardware can 
so be considered as a symmetric multi-processor machine and the software 
can use it true parallel environment. 
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In this work I show some resuhs obtained with parallel computations 
using Matlab [2] programs on Intel technology. The physical and logical 
characteristics of the used machine are presented in the following tables: 



HcirdwEire 


Type 


HP Compaq ProLiant DL380 


Processors 


2 Intel Xeon 2.40 GHz 


Ram 


1.5 GB 


Storage 


6 SCSI disks 36.5 GB - Raid 5 



Software 


Operating System 


MS Windows Server 2000 


Matlab 


V. 5.3 



The Matlab programs used for these experiments was based on cycles of 
floating-point computations. 

2 The parallel Matlab environment 

The package Matlab has not a native support for parallel elaboration and 
multithreading [3]. Yet, there are some extensions, as tools and libraries [4], 
for the use of a parallel environment on multi-processors hardware. With 
the biprocessor machine I have preferred the method of splitting a given 
computation on multiple instances of the runtime Matlab program. A single 
master instance starts the slave copies and assigns to each of them the same 
set of instructions on different sets of data. Hence I have simulated a SPMD 
computation on the machine. 

In this way the parallel environment is simple, because there is not need 
of external libraries or calls to interfaces, and flexible, because to a single 
slave copy it can be assigned a set of different instructions for realizing a 
MPMD computation. 

With this method the exchange of messages among independent processes 
is a problem. The only way to communicate from one Matlab copy to another 
is the use of shared flies. In a second type of experiments I show that this 
method is not critical for the time execution if one uses fast mass-storage as 
SCSI or FiberChannel systems. 
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2.1 The SPMD programs 



In the experiments I have defined a master Matlab function which writes to 
a shared file system the .m scripts to be executed by slaves Matlab copies. 
These copies are launched in background mode for the parallel execution. 
The master program controls the end of the computations using a simple set 
of lock-files. The slaves finish their work, save on files the results and cancel 
the own lock-file. The master reads the sets of data from these files for other 
possible computations. Now I describe the principal code of the program. 

This is the declaration of the function masterf. The lockarray variable is 
an array for testing the presence of the lock-files during the slaves compu- 
tation. The finalres is an array for the collection of the partial results from 
slaves. The string computing is the mathematical expression to use in the 
computation. 

function [elapsedtime,totaltime,executiontime]=masterf(nproc,maxvalue,step,computing) 

% 

% MASTERF: master function for parallel background computation. 

% 

% sintax: 

% [eIapsedtime,totaltime,executiontime]=masterf(nproc,maxvalue, step, computing) 

% 

% input parameters: 

% 

% nproc = number of processes; 

% maxvalue = sup-limitation of the data-array to process; the inf-limitation is 0; 
% step = difference from two consecutive numbers in the data-array; 
% computing = the string of the mathematical expression to compute; 
% 

% output parameters: 

% 

% elapsedtime = total elapsed time to complete the execution of the computation; 

% totaltime = sum of the single slaves CPU-time to complete the single computation; 

% executiontime = single slaves CPU-time to complete the assigned computation; 

ostype=computer; 

tottime=0.; 

lockarray =0:nproc-l; 

numbervalues=maxvalue; 

computingstring=[' ' computing]; 

finalres= [] ; 

After the assignment of the own value to variable workdir, working direc- 
tory of Matlab, a cycle writes on storage the slaves lock-files. 

for i=0:nproc-l 

filelock = strcat(workdir,'filelock',int2str(i)); 
fid=fopen(filelock,'wr'); 
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fwrite(fld,"); 
fclose(ficl); 

end 

In the next fragment of program, the master sets the commands for the 
writing of an appropriate Matlab .m script for every slave process. Such script 
contains the instruction for determining the CPU-time spent on calculus, 
the expression of the mathematical computation, the instruction to save on 
storage the data computed and the CPU-time, finally the instruction to delete 
the lock-file. 

for i=0:nproc-l 

if (i==0) middlestep=0; else middlestep=l; end 
infdata=i*(numbervalues/nproc) + middlestep*step; 
supdata=(i-|-l ) * (numbervalues/nproc) ; 
fileworker = strcat(workdir,'flleworker',int2str(i),'.m'); 
commandworkertmp = ... 

strcat('x=',num2str(infdata),':',num2str(step),':',num2str(supdata),." 

'; tl=cputime; ',computingstring,... 

'; t2=cputime-tl; save out',int2str(i)); 
commandworker = ['cd ' workdir '; ' commandworkertmp ... 

' y t2; ' 'delete fllelock'int2str(i) '; exit;']; 
fid = fopen(fileworker,'wt'); 
fwrite(fid,commandworker) ; 
fclose(fid); 

end 

After the instructions for determining the CPU-time and the elapsed- 
time (tic) spent by the master program, a cycle launches the slaves Matlab 
runtimes. In the case of Windows operating system, the startcommand string 
is "start", an OS command for the background running of an executable 
program, and the osstring string is "dos". In the case of Unix-like operating 
system, the string are "sh" and "unix" respectively. Each slave executes 
immediately the fileworker script, as shown by the Matlab "-r" parameter. 

tl = cputime; 
tic; 

for i=0:nproc-l 

fileworker = strcat('flleworker',int2str(i)); 

commandrun = [startcommand ' matlab -minimize -r ' fileworker]; 
eval(strcat ( [osstring, '(',"" , commandrun," ")');'])); 

end 

In the next fragment of code the master program executes a cycle for de- 
termining the end of slaves computations. It controls if the lockarray variable 
has some process's rank non negative. In this case, it attempts to open the 
relative lock- file; if the file still exists, the master closes it, else the lockarray 
process position is set to -1. The pause instruction can be useful for avoiding 
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an excessive frequency, hence an high cpu-time consuming, in the "while" 
cycle. 

lockarraytmp=flnd(lockarray > -1); 
while (length(lockarraytmp) > 0) 
pause(.l); 
for i=lockarraytmp 

fid = fopen(strcat('filelock',int2str(i-l)),'r'); 
if (fid < 0) 

lockarray(i) = -1; 

else 

fclose(fid); 

end 

end 

lockarraytmp=find(lockarray > -1); 

end 

At the end, the master reads the partial slaves computation outputs and 
stores them in an array. At this point the master cpu-time and elapsed time 
are registered too. The total execution time is defined as sum of the slaves 
computation cpu-time, and is useful for comparison with the execution time 
in the case nproc = 1. The single slave execution time is defined as the 
arithmetic mean of all the partial execution times. 

for i=0:nproc-l 

partialres = load(strcat('out',int2str(i))); 
finalres = [finalres partialres]; 

end 

elapsedtime = toe; 
totaltime = cputime - tl; 

for i=0:nproc-l 

fps = load(strcat('out',int2str(i))); 
tottime = tottime + fps.t2; 
executiontime = tottime/nproc; 

end 



3 Tests and results 

For the tests I have used the following values for the master/ parameters: 
nproc: from 1 to 8; 

maxvalue: m * 10000, where m = 1, 2, 3; 
step: 0.001; 

computing: y = 5432.060708 * cos{{sm{x^-^'^^))'^-'^^'^^) . 
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I have also tested the program without the slaves saving of partial com- 
putations results and their final master load, for determining the influence 
of the I/O storage operations on the times of execution. 

The following are the results obtained with four tests for every type of 
experiment. The values are arithmetic means approximated to two decimals 
and they are expressed in seconds; the numbers from 1 to 8 are the value of 
nproc, while "m" is the factor parameter in the maxvalue expression. 
I have not reported the elapsed-times, because they weren't different from 
the cpu-times registered, probably due to the fact that, during the experi- 
ments, the server was dedicated only to the computations. 



Tables of results. 

All the values are expressed in seconds. 



l.a Medium execution cpu-times for process, no data storage: 



m 


1 


2 


3 


4 


5 


6 


7 


8 


1 


.i9.:i2 


20.89 


1 i.()0 


11.19 


11. so 


8.7:i 


9.1.') 


7.29 


2 


77.56 


40.83 


29.20 


23.41 


22.31 


23.19 


17.93 


19.02 


3 


137.75 


69.70 


51.67 


35.28 


34.91 


36.98 


32.46 


34.86 



2. a Total execution cpu-times, no data storage: 



m 


1 


2 


3 


4 


5 


6 


7 


8 


1 


41.01 


22.11 


16.19 


13.67 


18.42 


17.10 


18.17 


16.20 


2 


78.40 


41.89 


.30.74 


24.81 


26.78 


34.69 


33.68 


32.95 


3 


139.05 


75.66 


59.38 


38.83 


40.28 


49.69 


48.98 


48.27 



l.b Medium execution cpu-times for process, with data storage: 



m 


1 


2 


3 


4 


5 


6 


7 


8 


1 


42.78 


20.59 


15.99 


14.61 


11.05 


8.89 


11.56 


10.94 


2 


99.93 


52.66 


38.59 


25.22 


20.71 


18.60 


18.55 


18.70 


3 


151.03 


80.49 


57.33 


39.65 


28.83 


36.03 


44.16 


44.76 
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2.b Total execution cpu-times, with data storage: 



m 


1 


2 


3 


4 


5 


6 


7 


8 


1 


49.55 


28.70 


26.69 


27.26 


17.98 


18.92 


17.78 


17.76 


2 




08. 50 


51.:-!:-! 


10.5:-! 


:-!8.00 


:-!8.:-!7 


:-!0.7() 


10.09 


3 


201.65 


102.90 


75.03 


58.94 


60.97 


66.52 


67.92 


67.83 



4 Analysis of results 

From the results of the previous section, I deduce the following observations: 

1. In table l.a the gain in execution speed is good from 1 to 4 processes, 
while from 5 to 8, and in particular in the case m=3, the gain is low; 
this fact can be due to the excess load on the dual-processor machine 
when nproc > 4; 

2. In the same table the speedup [5] is quasi-linear from 1 to 4, hence the 
algorithm and the parallel environment used are an optimized SPMD 
implementation if one is interested only on pure computation time; 

3. In table 2. a the value nproc=4: gives the best performance as total exe- 
cution time; hence, if one is interested on the time spent by the master 
to verify when the slaves finish their job, the four virtual processors 
guaranteed by Hyper Threading technology show the best efficiency in 
the case of four running processes; 

4. With exclusion of the case nproc=l, when the master must verify only 
a single process, in the case m=3 the difference between the times rows 
of 2. a and l.a is smallest in the case nproc=A; hence in this case too the 
total time registered by the master is optimized respect to the slaves 
execution time; 

5. In the case of data storage, the preceding conclusions are not so clear, 
probably due to the fact that the reading-writing of small files is in 
general optimized by the RAID 5 technology on a multi-disks system; 
this fact seems to be confirmed by the great difference, 50 seconds, from 
the times registered in the case m=3 and nproc— 1 in tables .b; 
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6. In table 2.b one can note that for m=3 the best result is in the case 
nproc—A:] 

7. In tables .b, in the case m=3 the difference between the times rows is 
smallest in the case nproc—A too; 

8. Prom tables 2.b, one can note that for m=l the best speedup (8 pro- 
cesses) respect to the case nproc—1 is 2.78, while for m=3 the best (4 
processes) is 3.42; hence the virtual processors seem to have a better 
performance with a large amount of data to compute, probably due 
to the fact that in this case the parallel computations have a greater 
relevance respect to the physical operations on storage system. 

4.1 Conclusions 

From the previous facts one can deduce that a virtual processors technology 
as Hyper Threading can be a good choice for running SPMD programs in 
the case that 

• the number of parallel processes is equal to the number of virtual pro- 
cessors; 

• the data to be computed have a large amount, particularly when their 
distribution among processes and the merging of final results are based 
on files stored on fast storage system. 
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